Titanic : Machine Learning from Disaster

opening this file on mobile phones might result in bad graphs visualization

Please make sure you are connected to the internet, some of the graphs are imported from our Tableau cloud acount !!!

https://public.tableau.com/profile/lawrence8674#!/

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor,AdaBoostClassifier,AdaBoostClassifier
from  sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC 
from sklearn.decomposition import PCA
from xgboost.sklearn import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
import math
import itertools
from sklearn.decomposition import PCA as sklearnPCA
import plotly.offline as pyo
import plotly.express as px

# Set notebook mode to work in offline
pyo.init_notebook_mode()

Helper Methods

returns a Pandas dataframe including missing values and their percentage

In [3]:
def missing_value_of_data(data):
    total=data.isnull().sum().sort_values(ascending=False)
    percentage=round(total/data.shape[0]*100,2)
    return pd.concat([total,percentage],axis=1,keys=['Total','Percentage %'])

this method extract the title from the names and replace the name feature by title

In [4]:
def replace_names_with_titles(df):
    name_titles = []
    for name in df.Name:
        comma = name.find(",") + 1
        point = name.find(".",comma)
        name_titles.append(name[comma:point])
    df.Name = name_titles
    df.rename(columns = {'Name':'Title'}, inplace = True) 

filling the missing ages data with the avg. of each title

In [5]:
def filling_age_by_titles_avg(df):
    for title in set(df.Title):
        title_age_avg = df[df.Title == title ].Age.mean()
        
        # in case all the ages values for this title are NaN so we have no avg for them we will use the regular age avg instead
        if math.isnan(title_age_avg) :
            title_age_avg=df.Age.mean()
        mask = df.Title == title
        df.loc[mask, 'Age'] = df.loc[mask, 'Age'].fillna(value=title_age_avg)

method for deep optimizing by droping features :

this function tries to inhance the model's accuracy by droping the least important feature iteratively , it will stop when the accuracy start to drop or when the min number of features for the model is reached

  • curr_train_x : train_x after droping feature at a time .
  • curr_test_x : test_x after droping feature at a time .
  • min_features : the function continue to drop features until reaching this number .
  • best_score : the score of the best model .
  • isBetter : a flag that checks if the current model after droping feature is better than the best model .
  • curr_model : the current model that i build each time i drop a feature .
  • returns : best_model : the best model that was built after droping the unnecessary features . excluded_features : array that include features that were excluded from the model .
In [6]:
def optimize_by_droping_features(train_x,train_y,test_x,test_y,min_features,score,model,param_grid):
    
    curr_train_x =  train_x
    curr_test_x  =  test_x
    min_features =  min_features
    best_score   =  score
    isBetter     =  True
    best_model   =  model
    excluded_features = []

    while isBetter :

        if(len(curr_train_x.columns) > min_features):

            features = list(curr_train_x.columns)
            importance_df = pd.DataFrame({'feature': features,
                                          'importance': best_model.feature_importances_}).\
                                           sort_values('importance', ascending = False)

            display(importance_df)
            print()

            least_important_feature = importance_df['feature'].iloc[-1]

            print('Trying To Drop Feature : '+ least_important_feature)

            curr_train_x = train_x.drop(least_important_feature, 1)
            curr_test_x  = test_x.drop(least_important_feature, 1)
            
            print('Optimizing The Best Model Without The Feature , Please Wait ⏳ ... ')
            
            
            gs = GridSearchCV(best_model,param_grid = param_grid, cv = 5, n_jobs = -1,iid=False)
            gs.fit(curr_train_x,train_y)
            current_model = gs.best_estimator_
            pred_y = current_model.predict(curr_test_x)

            current_score = accuracy_score(test_y, pred_y)
            print("Accuracy After Droping Feature : ", current_score)

            if(current_score >= best_score) :
                
                excluded_features.append(least_important_feature)
                print('Droping Feature Was Efficient ✔️')
                print('excluded_features',excluded_features)
                print()
                
                train_x      = curr_train_x
                test_x       = curr_test_x
                best_score   = current_score
                best_model   = current_model
                pred_y       = best_model.predict(test_x)
                

            else:
                print("Droping Feature Was Not Efficient ❌")
                print()
                print("Optimization By Droping Features Is Done")
                print('excluded_features',excluded_features)
                print()
                isBetter = False
        else :
            print("Minimum Number Of Features Is Reached ⚠️")
            print("Optimization By Droping Features Is Done")
            print('excluded_features',excluded_features)
            print()
            isBetter = False
    print('The Best Score For Model is : ',best_score)       
    return best_model , excluded_features

training a given model on the train+validation set as one set and predict the test , this method is called after building the models

In [7]:
def final_train_and_predict(model,train_x,train_y,validation_x,validation_y,test,excluded):
    
    train_x = pd.concat([train_x,validation_x])  ## train + validation
    train_x = train_x.drop(excluded,1) ## removing feature that were excluded , if there is any
    train_y  = pd.concat([train_y,validation_y]) 
    model.fit(train_x,train_y) ## train the model with train + validation
    pred_y = model.predict(test.drop(excluded,1))

    return pred_y

exporting the result of predictions to .csv file to be uploaded to kaggle

In [8]:
def export_to_csv (predictions, passengers_id, file_name):
    predections = pd.DataFrame({'PassengerId' : list(passengers_id),'Survived' : list(predictions) }, columns=['PassengerId', 'Survived'])
    predections.to_csv(file_name +".csv", index=False)

viewing correlation between features in a correlation map

In [9]:
def view_correlation(df):
    corr=df.corr()
    plt.figure(figsize=(9,9))
    sns.heatmap(corr, linewidths=0.01,square=True,annot=True,annot_kws={"size": 12},cmap='YlGnBu',linecolor="white" )
    sns.set(font_scale=0.8)
    plt.title('Correlation between features parameters matrix');
    b, t = plt.ylim()
    b += 0.5
    t -= 0.5 
    plt.ylim(b, t) 
    plt.show() 

This function prints and plots the confusion matrix.

In [10]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Oranges):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.figure(figsize = (10, 10))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, size = 24)
    plt.colorbar(aspect=4)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, size = 14)
    plt.yticks(tick_marks, classes, size = 14)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    
    # Labeling the plot
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), fontsize = 20,
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    b, t = plt.ylim()
    b += 0.5
    t -= 0.5 
    plt.ylim(b, t)  
    plt.grid(None)
    plt.tight_layout()
    plt.ylabel('True label', size = 18)
    plt.xlabel('Predicted label', size = 18)

elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

In [11]:
def elbow_met(df):
    distortions = []
    K = range(1,10)
    for k in K:
        kmeanModel = KMeans(n_clusters=k).fit(df)
        kmeanModel.fit(df)
        distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df.shape[0])

    # Plot the elbow
    plt.figure(figsize=(10,7))
    plt.plot(K, distortions, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Within groups sum of squares')
    plt.title('The Elbow Method showing the optimal k')
    plt.show()

1. Reading The Data

In [12]:
train = pd.read_csv(r'C:\Users\L.A\Desktop\titanic\train.csv' , encoding = "ISO-8859-8")
test  = pd.read_csv(r'C:\Users\L.A\Desktop\titanic\test.csv'  , encoding = "ISO-8859-8")
passengers_id = test.PassengerId # to be used later when real life testing

2+3. Data Engineering and visualization

In [13]:
%%HTML
<br>
<div class='tableauPlaceholder' id='viz1592437685721' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;ML&#47;ML_embarked&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='ML_embarked&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;ML&#47;ML_embarked&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1592437685721');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.55)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<div class='tableauPlaceholder' id='viz1592437607255' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;ML&#47;ML_embarked&#47;Sheet3&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='ML_embarked&#47;Sheet3' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;ML&#47;ML_embarked&#47;Sheet3&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1592437607255');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.45)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>


Surviving for age,fare,parch

In [14]:
fig = px.scatter_3d(train, x='Age', y='Fare', z='Parch',
              color='Survived')
fig.show()

checking empty cells

In [17]:
missing_value_of_data(train)
Out[17]:
Total Percentage %
Cabin 687 77.10
Age 177 19.87
Embarked 2 0.22
Fare 0 0.00
Ticket 0 0.00
Parch 0 0.00
SibSp 0 0.00
Sex 0 0.00
Name 0 0.00
Pclass 0 0.00
Survived 0 0.00
PassengerId 0 0.00

Embarked is a categorial feature , since we have only 2 missing values in Embarked , i decided to drop them instead of trying to guess their values

In [18]:
train = train[(train.Embarked.notnull())]

Cabin is a categorial feature , i decided to drop the feature since more than 50% of the values are empty

In [19]:
#Tarin
train.drop('Cabin',1,inplace=True)

#Test
test.drop('Cabin',1,inplace=True)

Ticket number and Passenger Id are unique features for each passenger so i will drop them since they will not help us find patterns in the data

In [20]:
#Train
train.drop('Ticket',1,inplace=True)
train.drop('PassengerId',1,inplace=True)

#Test
test.drop('Ticket',1,inplace=True)
test.drop('PassengerId',1,inplace=True)

extracting name title from the name field and using it as a categorial feature for 2 reasons :

  • we can use the titles to fill the missing ages according to each category avg (Mr,Mrs,Lady...) , for example a row with 'Mr' and a NaN Age will get the average of all 'Mr' rows
  • having more features that will be used in the model might help our model finding patterns in the data
In [21]:
replace_names_with_titles(train)
In [22]:
replace_names_with_titles(test)
In [23]:
%%html
<br>
<div class='tableauPlaceholder' id='viz1592587025547' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book2_15925864916510&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Book2_15925864916510&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book2_15925864916510&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1592587025547');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.55)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
<br>


filling the missing age values according to the averages of each title category

In [24]:
filling_age_by_titles_avg(train)
In [25]:
filling_age_by_titles_avg(test)

surviving for Age and Fare

In [27]:
f = plt.figure(figsize=(20,6))
sns.scatterplot(x='Age',y='Fare',hue='Survived',data=train)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ac5d23688>

Age distribution :

In [28]:
f = plt.figure(figsize=(20,6))
f.add_subplot(1,2,1)
sns.distplot(train['Age'])
f.add_subplot(1,2,2)
sns.boxplot(train['Age'])
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ac5f90488>
  • From the distplot, it can be seen that the density of the data lies in the range of 20–40 years
  • The boxplot shows that the data has a few outliers
In [29]:
missing_value_of_data(test)
Out[29]:
Total Percentage %
Fare 1 0.24
Embarked 0 0.00
Parch 0 0.00
SibSp 0 0.00
Age 0 0.00
Sex 0 0.00
Title 0 0.00
Pclass 0 0.00

filling the empty Fare cell with the mean (only 1 missing value)

In [30]:
test.Fare=test.Fare.fillna(test.Fare.mean())

converting string values to numbers

In [31]:
le = preprocessing.LabelEncoder()

#Train
train.Embarked = le.fit_transform(train.Embarked)
train.Sex = le.fit_transform(train.Sex)
train.Title = le.fit_transform(train.Title)

#Test
test.Embarked = le.fit_transform(test.Embarked)
test.Sex = le.fit_transform(test.Sex)
test.Title = le.fit_transform(test.Title)
In [32]:
real_life_test = test.copy()

2d data visualization (PCA)

In [35]:
d2 = pd.DataFrame(sklearnPCA(n_components=2).fit_transform(train))
plt.figure(figsize=(30,10))
plt.scatter(d2.iloc[:, 0], d2.iloc[:, 1], s=50 );
plt.title('2d data visualization')
Out[35]:
Text(0.5, 1.0, '2d data visualization')

checking the corelation between the features

In [36]:
view_correlation(train.drop('Survived',1)) 

since there is no perfect corelation between the features we will build the models with all of the features at the begining

In [37]:
train.describe()
Out[37]:
Survived Pclass Title Sex Age SibSp Parch Fare Embarked
count 889.000000 889.000000 889.000000 889.000000 889.000000 889.000000 889.000000 889.000000 889.000000
mean 0.382452 2.311586 10.241845 0.649044 29.699916 0.524184 0.382452 32.096681 1.535433
std 0.486260 0.834700 1.830363 0.477538 13.245631 1.103705 0.806761 49.697504 0.792088
min 0.000000 1.000000 0.000000 0.000000 0.420000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 8.000000 0.000000 21.662069 0.000000 0.000000 7.895800 1.000000
50% 0.000000 3.000000 11.000000 1.000000 30.000000 0.000000 0.000000 14.454200 2.000000
75% 1.000000 3.000000 11.000000 1.000000 35.654206 1.000000 0.000000 31.000000 2.000000
max 1.000000 3.000000 16.000000 1.000000 80.000000 8.000000 6.000000 512.329200 2.000000

4. Prediction Models

In [38]:
RSEED = 10

Clustering The Data

In [39]:
elbow_met(train[['Age','Fare','SibSp','Parch']])

We will set k=3 as it is the elbow

In [40]:
train_with_cluster = train.copy()
real_life_test_with_cluster = real_life_test.copy()
In [41]:
k = 3
kmeans = KMeans(n_clusters = k).fit( train[['Age','Fare','SibSp','Parch']])
train_with_cluster['cluster']= kmeans.labels_ # add the clusters

kmeans = KMeans(n_clusters = k).fit(real_life_test)
real_life_test_with_cluster['cluster']= kmeans.labels_ # add the clusters

Train , Test split

In [43]:
feature_cols = ['Pclass','Title','Sex', 'Age','SibSp','Parch','Fare','Embarked']
x = train[feature_cols] 
y = train.Survived
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, random_state=1)

# with the cluster as a feature
feature_cols = ['Pclass','Title','Sex', 'Age','SibSp','Parch','Fare','Embarked','cluster']
x = train_with_cluster[feature_cols] 
y = train_with_cluster.Survived
train_x_with_cluster, test_x_with_cluster, _, _ = train_test_split(x, y, test_size=0.3, random_state=1) 

Decision Tree

without clustering :

The base model

In [174]:
# Creating a Decision Tree Classifier
tree = DecisionTreeClassifier(random_state=RSEED)
tree.fit(train_x, train_y)
pred_y = tree.predict(test_x)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.7602996254681648

The tuned model

In [175]:
# Hyperparameter grid
param_grid = {
    'criterion':['gini', 'entropy'], # Whether the critertion of the tree is gini or entropy
    'max_depth': [3,5,8,15,20,30,40, 50, 60] , # The maximum depth of the tree.
    'max_features': ['sqrt', None,1,2,3,4,5,6,7], # The number of features to consider when looking for the best split
    'min_samples_split': [2, 4,6,8,10,12,15], # The minimum number of samples required to split an internal node
}
    
# Create a grid search object
gs = GridSearchCV(tree, param_grid, cv=5, scoring='accuracy',iid=False)

# Fit the grid search
gs.fit(train_x, train_y)

pred_y = gs.predict(test_x)
score = accuracy_score(pred_y,test_y)

print('Accuracy : ',score)
Accuracy :  0.850187265917603

droping features

In [176]:
model=gs.best_estimator_
best_model,excluded_features = optimize_by_droping_features(train_x,train_y,test_x,test_y,3,score,model,param_grid)
feature importance
2 Sex 0.564129
0 Pclass 0.166153
3 Age 0.097618
4 SibSp 0.068516
6 Fare 0.065466
7 Embarked 0.038118
1 Title 0.000000
5 Parch 0.000000
Trying To Drop Feature : Parch
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8089887640449438
Droping Feature Was Not Efficient ❌

Optimization By Droping Features Is Done
excluded_features []

The Best Score For Model is :  0.850187265917603

final train and export to .csv

In [177]:
pred_y = final_train_and_predict(model,train_x,train_y,test_x,test_y,real_life_test,excluded_features)
export_to_csv(pred_y,passengers_id,'DecisionTreeWithoutCluster')

with clustering :

The base model

In [178]:
# Creating a Decision Tree Classifier
tree = DecisionTreeClassifier(random_state=RSEED)
tree.fit(train_x_with_cluster, train_y)
pred_y = tree.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.7602996254681648

The tuned model

In [179]:
# Create a grid search object
gs = GridSearchCV(tree, param_grid, cv=5, scoring='accuracy',iid=False)

# Fit the grid search
gs.fit(train_x_with_cluster, train_y)

pred_y = gs.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)

print('Accuracy : ',score)
Accuracy :  0.8089887640449438

droping features

In [180]:
model=gs.best_estimator_
best_model,excluded_features = optimize_by_droping_features(train_x_with_cluster,train_y,test_x_with_cluster,test_y,3,score,model,param_grid)
feature importance
2 Sex 0.435059
6 Fare 0.225088
0 Pclass 0.088208
4 SibSp 0.079168
3 Age 0.077649
7 Embarked 0.042361
5 Parch 0.034602
1 Title 0.017865
8 cluster 0.000000
Trying To Drop Feature : cluster
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.850187265917603
Droping Feature Was Efficient ✔️
excluded_features ['cluster']

feature importance
2 Sex 0.564129
0 Pclass 0.166153
3 Age 0.097618
4 SibSp 0.068516
6 Fare 0.065466
7 Embarked 0.038118
1 Title 0.000000
5 Parch 0.000000
Trying To Drop Feature : Parch
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8089887640449438
Droping Feature Was Not Efficient ❌

Optimization By Droping Features Is Done
excluded_features ['cluster']

The Best Score For Model is :  0.850187265917603

final train and export to .csv

In [181]:
pred_y = final_train_and_predict(model,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,excluded_features)
export_to_csv(pred_y,passengers_id,'DecisionTreeWithCluster')

Random forest

without clustering :

The base model

In [99]:
rfc = RandomForestClassifier(n_estimators = 300 ,random_state = RSEED ,min_samples_split = 5)
rfc = rfc.fit(train_x,train_y)
pred_y = rfc.predict(test_x)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.8352059925093633

The tuned model

In [100]:
# Hyperparameter grid
param_grid = {
    'n_estimators': [10,100, 200, 250] ,#The number of trees in the forest.
    'max_depth':    [None, 50, 60, 70] ,#The maximum depth of the tree.
    'max_features': ['sqrt', None],#he number of features to consider when looking for the best split
    'min_samples_split': [2, 10],#The minimum number of samples required to split an internal node
    'bootstrap': [True, False]#Whether bootstrap samples are used when building trees.
}
rs = GridSearchCV(rfc, param_grid, n_jobs = -1,scoring = 'accuracy', cv = 5 ,iid=False)

rs.fit(train_x,train_y)
pred_y=rs.predict(test_x)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.850187265917603

droping features

In [101]:
model=rs.best_estimator_
best_model,excluded_features = optimize_by_droping_features(train_x,train_y,test_x,test_y,3,score,model,param_grid)
feature importance
6 Fare 0.220919
2 Sex 0.219404
3 Age 0.176580
1 Title 0.150512
0 Pclass 0.097118
4 SibSp 0.053249
7 Embarked 0.042907
5 Parch 0.039312
Trying To Drop Feature : Parch
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8689138576779026
Droping Feature Was Efficient ✔️
excluded_features ['Parch']

feature importance
2 Sex 0.292201
5 Fare 0.258310
3 Age 0.189801
0 Pclass 0.092162
1 Title 0.079377
4 SibSp 0.054084
6 Embarked 0.034065
Trying To Drop Feature : Embarked
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8614232209737828
Droping Feature Was Not Efficient ❌

Optimization By Droping Features Is Done
excluded_features ['Parch']

The Best Score For Model is :  0.8689138576779026

final train and export

In [ ]:
pred_y = final_train_and_predict(best_model,train_x,train_y,test_x,test_y,real_life_test,excluded_features)
export_to_csv(pred_y,passengers_id,'RandomForestWithoutCluster')

with clustering :

The base model

In [54]:
rfc_with_cluster = RandomForestClassifier(n_estimators = 300 ,random_state = RSEED ,min_samples_split = 5)
rfc_with_cluster = rfc_with_cluster.fit(train_x_with_cluster,train_y)
pred_y = rfc_with_cluster.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.8352059925093633

The tuned model

In [55]:
rs = GridSearchCV(RandomForestClassifier(random_state = RSEED), param_grid, n_jobs = -1,scoring = 'accuracy', cv = 5 ,iid=False)
rs.fit(train_x_with_cluster,train_y)
pred_y=rs.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.8352059925093633

droping features

In [48]:
model=rs.best_estimator_
best_model,excluded_features=optimize_by_droping_features(train_x_with_cluster,train_y,test_x_with_cluster,test_y,3,score,model,param_grid)
feature importance
2 Sex 0.246645
6 Fare 0.181178
1 Title 0.154972
3 Age 0.152202
0 Pclass 0.094096
4 SibSp 0.061646
7 Embarked 0.049400
8 cluster 0.033421
5 Parch 0.026440
Trying To Drop Feature : Parch
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.850187265917603
Droping Feature Was Efficient ✔️
excluded_features ['Parch']

feature importance
2 Sex 0.224288
5 Fare 0.218874
3 Age 0.182412
1 Title 0.152969
0 Pclass 0.099581
4 SibSp 0.052285
6 Embarked 0.041444
7 cluster 0.028147
Trying To Drop Feature : cluster
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8689138576779026
Droping Feature Was Efficient ✔️
excluded_features ['Parch', 'cluster']

feature importance
2 Sex 0.292201
5 Fare 0.258310
3 Age 0.189801
0 Pclass 0.092162
1 Title 0.079377
4 SibSp 0.054084
6 Embarked 0.034065
Trying To Drop Feature : Embarked
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8614232209737828
Droping Feature Was Not Efficient ❌

Optimization By Droping Features Is Done
excluded_features ['Parch', 'cluster']

The Best Score For Model is :  0.8689138576779026

Conclusion for random forest: adding the cluster as a feature wasn't better so we will use the model without the cluster to predict the real test

final train and export to .csv

In [50]:
pred_y = final_train_and_predict(best_model,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,excluded_features)
export_to_csv(pred_y,passengers_id,'RandomForestWithCluster')

Adaboost

without clustering :

The base model

In [56]:
adb = AdaBoostClassifier()
adb.fit(train_x,train_y)
pred_y = adb.predict(test_x)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.8314606741573034

The tuned model

In [57]:
adb_param_grid = {'n_estimators' : [10,20,50,100,200,300,400,500,1000], 
                  'learning_rate': [0.001,0.01,0.03 ,0.05, 0.07,0.1,0.5, 1],
                  'algorithm'    : ['SAMME.R'],
                  'random_state' : [RSEED] }

gs = GridSearchCV(adb,param_grid = adb_param_grid, cv = 2, n_jobs = -1,iid=False)
best_adb = gs.fit(train_x,train_y)
pred_y = best_adb.predict(test_x)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.8389513108614233

droping features

In [58]:
model = gs.best_estimator_
best_model,excluded_features=optimize_by_droping_features(train_x,train_y,test_x,test_y,3,score,model,adb_param_grid)
feature importance
2 Sex 0.217
6 Fare 0.191
1 Title 0.153
3 Age 0.138
4 SibSp 0.109
0 Pclass 0.080
7 Embarked 0.059
5 Parch 0.053
Trying To Drop Feature : Parch
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8202247191011236
Droping Feature Was Not Efficient ❌

Optimization By Droping Features Is Done
excluded_features []

The Best Score For Model is :  0.8389513108614233

final train and export to .csv

In [59]:
pred_y = final_train_and_predict(best_model,train_x,train_y,test_x,test_y,real_life_test,excluded_features)
export_to_csv(pred_y,passengers_id,'AdaboostWithoutCluster')

with clustering :

The base model

In [66]:
adb = AdaBoostClassifier()
adb.fit(train_x_with_cluster,train_y)
pred_y = adb.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.8314606741573034

The tuned model

In [67]:
adb_param_grid = {'n_estimators' : [10,20,50,100,200,300,400,500,1000], 
                  'learning_rate': [0.001,0.01,0.03 ,0.05, 0.07,0.1,0.5, 1],
                  'algorithm'    : ['SAMME.R'],
                  'random_state' : [RSEED] }

gs = GridSearchCV(adb,param_grid = adb_param_grid, cv = 2, n_jobs = -1,iid=False)
best_adb = gs.fit(train_x_with_cluster,train_y)
pred_y = best_adb.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ',score)
Accuracy :  0.8389513108614233

droping features

In [69]:
model = gs.best_estimator_
best_model,excluded_features=optimize_by_droping_features(train_x_with_cluster,train_y,test_x_with_cluster,test_y,3,score,model,adb_param_grid)
feature importance
2 Sex 0.217
6 Fare 0.191
1 Title 0.153
3 Age 0.138
4 SibSp 0.109
0 Pclass 0.080
7 Embarked 0.059
5 Parch 0.053
8 cluster 0.000
Trying To Drop Feature : cluster
Optimizing The Best Model Without The Feature , Please Wait ⏳ ... 
Accuracy After Droping Feature :  0.8202247191011236
Droping Feature Was Not Efficient ❌

Optimization By Droping Features Is Done
excluded_features []

The Best Score For Model is :  0.8389513108614233

final train and export to .csv

In [70]:
pred_y = final_train_and_predict(best_model,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,excluded_features)
export_to_csv(pred_y,passengers_id,'AdaboostWithCluster')

KNN

In [72]:
knn = KNeighborsClassifier(n_neighbors=10, metric='euclidean')
knn.fit(train_x, train_y)
pred_y = knn.predict(test_x)
score = accuracy_score(pred_y,test_y)
print('Accuracy : ', score)
Accuracy :  0.7078651685393258
In [73]:
view_correlation(train.drop('Survived',1))

there is a high corelation between Fare and Pclass , we will building a model without each of them at a time and compare accuracy :

  • building model without Pclass feature
In [74]:
knn = KNeighborsClassifier(n_neighbors=10, metric='euclidean')
knn.fit(train_x.drop('Pclass',1), train_y)
pred_y = knn.predict(test_x.drop('Pclass',1))
score = accuracy_score(pred_y,test_y)
print('Accuracy : ', score)
Accuracy :  0.704119850187266
  • building model without Fare feature
In [75]:
knn = KNeighborsClassifier(n_neighbors=10, metric='euclidean')
knn.fit(train_x.drop('Fare',1), train_y)
pred_y = knn.predict(test_x.drop('Fare',1))
score = accuracy_score(pred_y,test_y)
print('Accuracy : ', score)
Accuracy :  0.7790262172284644
  • the base model with all features : 0.707

  • the base model without Pclass : 0.704 (-003)

  • the base model without Fare : 0.779 (+0.07)
  • we will build the models without Fare

without clustering :

The Tuned model

In [76]:
#List Hyperparameters that we want to tune.
leaf_size = list(range(1,50))
n_neighbors = list(range(1,50))
p = [1,2,3,4]

#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

#Use GridSearch
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, hyperparameters, cv = 10, iid=False)

#Fit the model
best_model = clf.fit(train_x.drop('Fare',1), train_y)

#Print The value of best Hyperparameters
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])
Best leaf_size: 3
Best p: 1
Best n_neighbors: 5
In [77]:
pred_y = best_model.predict(test_x.drop('Fare',1))
score = accuracy_score(pred_y, test_y)
print('Accuracy : ', score)
Accuracy :  0.8127340823970037

final train and export to .csv

In [78]:
pred_y = final_train_and_predict(best_model,train_x,train_y,test_x,test_y,real_life_test,['Fare'])
export_to_csv(pred_y,passengers_id,'KnnWithoutCluster')

With Cluststering

The base model

In [80]:
knn = KNeighborsClassifier(n_neighbors=10, metric='euclidean')
knn.fit(train_x_with_cluster.drop('Fare',1), train_y)
pred_y = knn.predict(test_x_with_cluster.drop('Fare',1))
score = accuracy_score(pred_y,test_y)
print('Accuracy : ', score)
Accuracy :  0.7715355805243446

The Tuned model

In [81]:
#Use GridSearch
knn = KNeighborsClassifier()
clf = GridSearchCV(knn, hyperparameters, cv=10,iid=False)

#Fit the model
best_model = clf.fit(train_x_with_cluster.drop('Fare',1),train_y)

#Print The value of best Hyperparameters
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])

score = accuracy_score(best_model.predict(test_x_with_cluster.drop('Fare',1)),test_y)

print()
print('Accuracy : ', score)
Best leaf_size: 3
Best p: 1
Best n_neighbors: 9

Accuracy :  0.8239700374531835

final train and export to .csv

In [82]:
pred_y = final_train_and_predict(best_model,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,['Fare'])
export_to_csv(pred_y,passengers_id,'KnnWithCluster')

SVM

without clustering :

The Base model

In [69]:
SVM = SVC(random_state=RSEED,gamma='auto')
SVM.fit(train_x,train_y)
pred_y = SVM.predict(test_x)
score = accuracy_score(pred_y,test_y)
print("Accuracy : ", score)
Accuracy :  0.7191011235955056
In [70]:
SVM
Out[70]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=10, shrinking=True, tol=0.001,
    verbose=False)

The Tuned model

In [71]:
param_grid = {'kernel': ['linear', 'rbf'],
              'degree': [2,3,6,7], 
              'gamma' : [0.00001, 0.01, 0.1, 1],
              'C'     : [1, 10, 1000]}
grid = GridSearchCV(SVC(RSEED),param_grid,cv=3,verbose=0,iid=False)
In [72]:
grid.fit(train_x,train_y)
grid.best_estimator_
Out[72]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma=1e-05, kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [73]:
pred_y = grid.predict(test_x)
score = accuracy_score(pred_y,test_y)
print("Accuracy : ", score)
Accuracy :  0.8202247191011236

final train and export to .csv

In [74]:
pred_y = final_train_and_predict(grid.best_estimator_,train_x,train_y,test_x,test_y,real_life_test,[])
export_to_csv(pred_y,passengers_id,'SvmWithoutCluster')

with clustering :

The Base model

In [75]:
SVM = SVC(random_state=RSEED,gamma='auto')
SVM.fit(train_x_with_cluster,train_y)
pred_y = SVM.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)
print("Accuracy : ", score)
Accuracy :  0.7228464419475655

The Tuned model

In [76]:
grid = GridSearchCV(SVC(RSEED),param_grid,cv=3,verbose=0,iid=False)
grid.fit(train_x_with_cluster,train_y)
print(grid.best_estimator_)
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma=1e-05, kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [77]:
pred_y = grid.predict(test_x_with_cluster)
score = accuracy_score(pred_y,test_y)
print("Accuracy : ", score)
Accuracy :  0.8202247191011236

final train and export to .csv

In [78]:
pred_y = final_train_and_predict(grid.best_estimator_,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,[])
export_to_csv(pred_y,passengers_id,'SvmWithCluster')

Neural networks

without clustering :

The Base model

In [44]:
#defining ann classifier using sklearn's MLPClassifier method
ann = MLPClassifier(random_state=RSEED)
# fitting model.
ann.fit(train_x, train_y)
Out[44]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=10, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)
In [45]:
# making predictions
pred_y = ann.predict(test_x)
In [46]:
accuracy = accuracy_score(test_y, pred_y)
print("Accuracy: " + str(accuracy))
Accuracy: 0.8014981273408239

The Tuned model

In [51]:
# using sklearn's method GridSearchCV

param_grid = {'batch_size': [8,16,18,20],
               'solver': ['sgd','adam'],
               'hidden_layer_sizes': [(8,8,8),(8,8),(8,12),(12),(12,12),(8,12,2)],
               'random_state' : [RSEED],
               'alpha' : [0.1,0.01,1],
               'max_iter':[500]}

ann_gs = GridSearchCV(ann,param_grid,cv=5, iid=False)

ann_gs.fit(train_x, train_y)

pred_y = ann_gs.predict(test_x)
accuracy = accuracy_score(test_y, pred_y)
print("Accuracy: " + str(accuracy))
Accuracy: 0.8164794007490637
In [52]:
print(ann_gs.best_params_)
{'alpha': 0.01, 'batch_size': 20, 'hidden_layer_sizes': (12, 12), 'max_iter': 500, 'random_state': 10, 'solver': 'adam'}

final train and export to .csv

In [53]:
pred_y = final_train_and_predict(ann_gs.best_estimator_,train_x,train_y,test_x,test_y,real_life_test,[])
export_to_csv(pred_y,passengers_id,' MlpWithoutCluster')

with clustering :

The Tuned model

In [79]:
#optimization to Neural Nets model
ann_gs = GridSearchCV(ann, param_grid,cv=5, iid=False)

ann_gs.fit(train_x_with_cluster, train_y)

pred_y = ann_gs.predict(test_x_with_cluster)
accuracy = accuracy_score(test_y, pred_y)
print("Accuracy: " + str(accuracy))
Accuracy: 0.7940074906367042

final train and export to .csv

In [80]:
pred_y = final_train_and_predict(ann_gs.best_estimator_,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,[])
export_to_csv(pred_y,passengers_id,' MlpWithCluster')

PCA

without clustering :

initiate PCA and classifier :

In [138]:
pca = PCA()

rfc  = RandomForestClassifier(random_state= RSEED*2, n_estimators=10)
rfc.fit(train_x,train_y)

param_grid = {
    'n_estimators': [10,100, 200, 250] , # The number of trees in the forest.
    'max_depth':    [None, 50, 60, 70] , # The maximum depth of the tree.
    'max_features': ['sqrt', None], # he number of features to consider when looking for the best split
    'min_samples_split': [2, 10], # The minimum number of samples required to split an internal node
    'bootstrap': [True, False] , # Whether bootstrap samples are used when building trees.
    'random_state':[RSEED*2]
}

classifier = GridSearchCV(rfc, param_grid, n_jobs = -1,scoring = 'accuracy', cv = 5 ,iid=False)

transform / fit the data

In [139]:
X_transformed = pca.fit_transform(train_x)
classifier.fit(X_transformed, train_y)
Out[139]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=10, n_jobs=None,
                                              oob_score=False, random_state=20,
                                              verbose=0, warm_start=False),
             iid=False, n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'max_depth': [None, 50, 60, 70],
                         'max_features': ['sqrt', None],
                         'min_samples_split': [2, 10],
                         'n_estimators': [10, 100, 200, 250],
                         'random_state': [20]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

transform test data using already fitted PCA

In [140]:
newdata_transformed = pca.transform(test_x)

predict labels using the trained classifier

In [141]:
pred_labels = classifier.predict(newdata_transformed)
score = accuracy_score(test_y,pred_labels)
print("Accuracy : ", score)
Accuracy :  0.8352059925093633

final train and export to .csv

In [142]:
pred_y = final_train_and_predict(classifier,train_x,train_y,test_x,test_y,real_life_test,[])
export_to_csv(pred_y,passengers_id,'PcaWithoutCluster')

with clustering :

In [135]:
classifier = GridSearchCV(rfc, param_grid, n_jobs = -1,scoring = 'accuracy', cv = 5 ,iid=False)
X_transformed = pca.fit_transform(train_x_with_cluster)
classifier.fit(X_transformed, train_y)
newdata_transformed = pca.transform(test_x_with_cluster)
In [136]:
pred_labels = classifier.predict(newdata_transformed)
score = accuracy_score(test_y,pred_labels)
print("Accuracy : ", score)
Accuracy :  0.8539325842696629

final train and export to .csv

In [137]:
pred_y = final_train_and_predict(classifier,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,[])
export_to_csv(pred_y,passengers_id,'PcaWithCluster')

XGboost

without clustering :

The Base model

In [155]:
xgb = XGBClassifier( random_state = RSEED)
xgb.fit(train_x,train_y)
pred_y = xgb.predict(test_x)
score = accuracy_score(test_y,pred_y)
print("Accuracy : ", score)
Accuracy :  0.797752808988764

The Tuned model

In [156]:
parameters = {
    'max_depth': range (2, 15, 1),
    'n_estimators': [10,50,100,250,300,350,400],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'min_child_weight': [1, 5, 10],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'learning_rate': [0.1, 0.5,0.01, 0.05,0.005,0.001]
}

xgb = XGBClassifier(random_state=RSEED)
clf = GridSearchCV(xgb, parameters, n_jobs=-1, cv=2, scoring='accuracy',verbose=2, refit=True)
In [157]:
clf.fit(train_x,train_y)
Fitting 2 folds for each of 24570 candidates, totalling 49140 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 479 tasks      | elapsed:   23.2s
[Parallel(n_jobs=-1)]: Done 1291 tasks      | elapsed:   48.4s
[Parallel(n_jobs=-1)]: Done 2423 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 3883 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 5663 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 7771 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 9457 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 10835 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 12373 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done 14075 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 15937 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 17963 tasks      | elapsed: 11.5min
[Parallel(n_jobs=-1)]: Done 19704 tasks      | elapsed: 13.2min
[Parallel(n_jobs=-1)]: Done 20879 tasks      | elapsed: 14.1min
[Parallel(n_jobs=-1)]: Done 22134 tasks      | elapsed: 15.1min
[Parallel(n_jobs=-1)]: Done 23471 tasks      | elapsed: 16.3min
[Parallel(n_jobs=-1)]: Done 24888 tasks      | elapsed: 17.5min
[Parallel(n_jobs=-1)]: Done 27072 tasks      | elapsed: 19.1min
[Parallel(n_jobs=-1)]: Done 30230 tasks      | elapsed: 21.6min
[Parallel(n_jobs=-1)]: Done 33552 tasks      | elapsed: 24.2min
[Parallel(n_jobs=-1)]: Done 37034 tasks      | elapsed: 27.2min
[Parallel(n_jobs=-1)]: Done 40680 tasks      | elapsed: 30.3min
[Parallel(n_jobs=-1)]: Done 44486 tasks      | elapsed: 33.6min
[Parallel(n_jobs=-1)]: Done 48456 tasks      | elapsed: 37.1min
[Parallel(n_jobs=-1)]: Done 49140 out of 49140 | elapsed: 37.7min finished
Out[157]:
GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constrai...
                                     validate_parameters=None, verbosity=None),
             iid='warn', n_jobs=-1,
             param_grid={'colsample_bytree': [0.6, 0.8, 1.0],
                         'gamma': [0.5, 1, 1.5, 2, 5],
                         'learning_rate': [0.1, 0.5, 0.01, 0.05, 0.005, 0.001],
                         'max_depth': range(2, 15),
                         'min_child_weight': [1, 5, 10],
                         'n_estimators': [10, 50, 100, 250, 300, 350, 400]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=2)
In [159]:
score = accuracy_score(test_y,clf.predict(test_x))
print("Accuracy : ", score)
Accuracy :  0.8651685393258427

final train and export to .csv

In [166]:
pred_y = final_train_and_predict(clf.best_estimator_,train_x,train_y,test_x,test_y,real_life_test,[])
export_to_csv(pred_y,passengers_id,'XGBoostWithoutCluster')

with clustering :

The Base model

In [178]:
xgb = XGBClassifier(random_state = RSEED)
xgb.fit(train_x_with_cluster,train_y)
pred_y = xgb.predict(test_x_with_cluster)
score = accuracy_score(test_y,pred_y)
print("Accuracy : ", score)
Accuracy :  0.8014981273408239

The Tuned model

In [179]:
xgb = XGBClassifier(random_state=RSEED)
clf = GridSearchCV(xgb, parameters, n_jobs=-1, cv=2, scoring='accuracy',verbose=2, refit=True)
In [180]:
clf.fit(train_x_with_cluster,train_y)
Fitting 2 folds for each of 24570 candidates, totalling 49140 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    9.3s
[Parallel(n_jobs=-1)]: Done 257 tasks      | elapsed:   16.7s
[Parallel(n_jobs=-1)]: Done 663 tasks      | elapsed:   32.5s
[Parallel(n_jobs=-1)]: Done 1229 tasks      | elapsed:   52.9s
[Parallel(n_jobs=-1)]: Done 1771 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 2216 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 3261 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 4475 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 5853 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 7391 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 9093 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 10955 tasks      | elapsed:  8.2min
[Parallel(n_jobs=-1)]: Done 12981 tasks      | elapsed:  9.5min
[Parallel(n_jobs=-1)]: Done 15167 tasks      | elapsed: 11.0min
[Parallel(n_jobs=-1)]: Done 17517 tasks      | elapsed: 13.0min
[Parallel(n_jobs=-1)]: Done 20027 tasks      | elapsed: 15.1min
[Parallel(n_jobs=-1)]: Done 22701 tasks      | elapsed: 17.3min
[Parallel(n_jobs=-1)]: Done 25535 tasks      | elapsed: 19.8min
[Parallel(n_jobs=-1)]: Done 27578 tasks      | elapsed: 22.0min
[Parallel(n_jobs=-1)]: Done 29157 tasks      | elapsed: 23.4min
[Parallel(n_jobs=-1)]: Done 30818 tasks      | elapsed: 24.8min
[Parallel(n_jobs=-1)]: Done 32559 tasks      | elapsed: 26.9min
[Parallel(n_jobs=-1)]: Done 34382 tasks      | elapsed: 29.0min
[Parallel(n_jobs=-1)]: Done 36285 tasks      | elapsed: 31.1min
[Parallel(n_jobs=-1)]: Done 38270 tasks      | elapsed: 33.1min
[Parallel(n_jobs=-1)]: Done 40335 tasks      | elapsed: 35.2min
[Parallel(n_jobs=-1)]: Done 42482 tasks      | elapsed: 37.2min
[Parallel(n_jobs=-1)]: Done 44709 tasks      | elapsed: 39.1min
[Parallel(n_jobs=-1)]: Done 47018 tasks      | elapsed: 41.2min
[Parallel(n_jobs=-1)]: Done 49140 out of 49140 | elapsed: 43.0min finished
Out[180]:
GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constrai...
                                     validate_parameters=None, verbosity=None),
             iid='warn', n_jobs=-1,
             param_grid={'colsample_bytree': [0.6, 0.8, 1.0],
                         'gamma': [0.5, 1, 1.5, 2, 5],
                         'learning_rate': [0.1, 0.5, 0.01, 0.05, 0.005, 0.001],
                         'max_depth': range(2, 15),
                         'min_child_weight': [1, 5, 10],
                         'n_estimators': [10, 50, 100, 250, 300, 350, 400]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=2)
In [182]:
pred_y = clf.predict(test_x_with_cluster)
score = accuracy_score(test_y,pred_y)
print("Accuracy : ", score)
Accuracy :  0.850187265917603

final train and export to .csv

In [183]:
pred_y = final_train_and_predict(clf.best_estimator_,train_x_with_cluster,train_y,test_x_with_cluster,test_y,real_life_test_with_cluster,[])
export_to_csv(pred_y,passengers_id,'XGBoostWithCluster')

Accuracy results

In [86]:
%%html
<div class='tableauPlaceholder' id='viz1593447555580' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book1_15934475387770&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Book1_15934475387770&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Bo&#47;Book1_15934475387770&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1593447555580');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='1350px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='1350px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.35)+'px';} else { vizElement.style.width='100%';vizElement.style.height='600px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Conclusion

  1. We assume that the difference between train and test results are due to the fact that the test set is different from the train set , and at some point, our models were over fitted to the train set because we are achieving a Hight accuracy predictions on the train .
  2. The average of the results of the based trees models was 2% bigger than the average of the other models.
  3. In most models clustering was not efficient, it gave us the same results .
  4. In cases where clustering helped , it increased the accuracy approximately 1% .
  5. SVM would give us better results if we had enough time to optimize it good .
  6. High cv values helped us with increasing the accuracy and to avoid overfitting in some cases .
  7. For this problem and similar problems, it is best to use decision tree-based models.
  8. Even in the models where we included the cluster as a feature , it was not an important feature , and some models advised to drop them to increase accuracy (see optimize_by_droping_features method and Random forest).
  9. In average we achieved 4.43% accuracy increase after tuning the models (see base model and model after tuning).
  10. Higher accuracy on train set do not necessarily mean a higher accuracy on test (for example comparing decision tree to rando forest) .
  11. In models that have a feature importance Parch feature was almost always irrelevant.
  12. There were no perfect correlation or linear relations between features so mostly we used all of the features and dropped the unnecessary features.

Problems we faced :

  1. There were a lot of missing values in many features, so we had to fill them in a smart way rather than replacing them with constant values .
  2. It was hard to understand the data without graphs , so we created a high quality and responsive graphs using Tableau and embedded them to the notebook which gave us a better view of the data .
  3. We had to search for each model’s hyperparameters in order to know how to tune it
  4. A big part of the code was duplicated along the models , so we wrote functions to keep the code clean and simple.
  5. Different random seeds gave us different scores so we had to try more than one and pick the best, we tried as much as possible to use the same seed for all the models , but some models gave us better results with different seeds so we changed it .
  6. It took a lot of time to tune the models , so we were working on the code logic all day and running optimization all night , it took from us a lot of time to achieve a good result .
  7. Uploading model’s predictions to Kaggle took a lot of time , we used to upload them in the early morning or after midnight to avoid server lagging .
  8. Kaggle has a daily uploads limit , and we wanted to check our predictions continuously after every change, so we opened 5 accounts to upload files as much as we need .
  9. 20% of the time we worked on the exercise , 80% waited the code to run .

Export to html :

In [92]:
!jupyter nbconvert --to html Titanic.ipynb
[NbConvertApp] Converting notebook Titanic.ipynb to html
[NbConvertApp] Writing 4302249 bytes to Titanic.html